Correlation and Linear Regression

University of San Francisco

Matt Meister

Relationships between multiple variables

We’ve just started to talk about relationships between multiple variables.

Do different advertisements lead to different interest?
How do different features of shoes lead to different ratings?
Do stores closer to population centers sign up more customers than further stores?

Why might we care a lot about relationships between multiple variables?

Often, we can manipulate factors related to them.
etc.

Relationships between multiple variables

Identifying multivariate relationships helps us as marketers.

e.g., If stores in population centers do better than further ones, what is an obvious implication?
- Move people closer to our stores
  - No!
- Place new stores near population centers

In this module, we focus on:

Understanding the relationships between pairs of variables (multivariate data)
Visualizing these relationships
Computing statistics that describe their associations
- Correlation coefficients
- Regression coefficients

Relationships between multiple variables

I’ve taken real data from rei.com and added a few simulated variables for class.

Download it from Canvas, and load it into R!

reiData <- read.csv('rei_products.csv')

Our Setting

For today’s class, we are going to take the role of a market analyst at REI
Specifically, our job is to identify new types of shoes that we should stock
We are going to use data from REI stores to make this decision

These Data

How many shoes?
1022
How many brands?
- 54
What’s a good way to look at these data?
Describe these variables:
- price
- avg_rating
- sales
- mens
- avg_size
- avg_running
- cushion
- weight

How Can We Predict `sales`?

We want to know what variables impact sales
If our goal is to increase future sales, these variables are worth knowing
Two general types of predictor variables
- Groups (discrete)
- Continuous
Luckily, we can test them the same way

Continuous Predictors

With continuous predictors (independent variables), we are seeing if some numeric variable predicts outcomes

Is there a meaningful relationship between these two variables?

Does x cause y?

Outline for Today

Data = Model + Error
Why is the mean our best guess (model) when we know nothing else?
Using lm() to find better (linear) models
Testing whether that model is better

Data = Model + Error

The whole point of this is to find better models to predict our data
“model” can be a single guess, a guess at a relationship, or something else
Regardless, we are trying to minimize Error
- Which is the sum of squared error

Continuous Predictors

In our REI data, our outcome is sales

Continuous Predictors

In our REI data, our outcome is sales

Without knowing anything else, what is our best guess at the sales a new shoe will bring?

The mean of sales (10.05)

Continuous Predictors

In our REI data, our outcome is sales

Without knowing anything else, what is our best guess at the sales a new shoe will bring?
- The mean of sales (10.05)
How wrong will we be normally?

The standard deviation of sales (0.92)

Continuous Predictors

In our REI data, our outcome is sales

Without knowing anything else, what is our best guess at the sales a new shoe will bring?
- The mean of sales (10.05)
How wrong will we be normally?
- The standard deviation of sales (0.92)
How could we get better guesses?

Use some other variable!
- Like weight

Continuous Predictors

In our REI data, our outcome is sales

We can use weight to predict sales better
- Why would we care whether weight predicts sales?
- Because we can make new shoes with different weights!
What is one way we have looked at continuous predictors so far (in R)?

Continuous Predictors: Scatterplots

ggplot(data = reiData,
       aes(x = weight, y = sales)) +
  geom_point(alpha = .5) +
  geom_smooth(method = 'lm', se = F, size = 1) +
  theme_bw()

Continuous Predictors

Is this scatterplot showing a strong relationship?

Scatterplots provide a lot of visual information
But they’re very ~vibesy~
- Not precise
And when there are more than a few variables, they’re a mess
Therefore, it’s nice to also have a number

Continuous Predictors

Are these scatterplots showing strong relationships?

Two ways to know with numbers:

Regression
- Using lm() in R
- Bueno
Correlations:
- Using cor() in R

Linear Regression

Regression does the following:

It helps us to understand how two (or more) variables vary together
We can make clear predictions from regression coefficients
We can test regression coefficients against a null hypothesis

Linear Regression

So what is linear regression?

Why did I say that the mean of a variable is our best guess at any given value of that variable?
Let’s try some guesses of reiData$sales

To understand this, let’s look the first 10 values of reiData

reiDataSmall <- head(reiData, 20)

And I’m going to only keep the columns I think are useful for this example

reiDataSmall <- reiDataSmall[, c('weight', 'sales')]
reiDataSmall

     weight     sales
1  3.160536 10.069211
2  4.970014 10.131396
3  6.759033  8.856539
4  3.068526 10.783086
5  5.254373 11.149486
6  4.769999 10.239619
7  3.831546 10.078726
8  3.805244 10.128758
9  3.172416 10.696915
10 2.901195  9.825507
11 2.461316 10.569462
12 2.737338 10.440997
13 2.982250  9.584194
14 5.910042 11.216148
15 5.154571  9.845226
16 6.090791  9.356569
17 5.153109 10.137007
18 4.059393  9.513671
19 5.565751  9.480360
20 2.473582 11.151987

Why is the mean the best guess?

If we are interested in the sales column, let’s see what it looks like:

ggplot(data = reiDataSmall, # Scatterplot
       aes(x = 0, 
           # ^ This is just a hacky way to get everything on the same spot on the x-axis
           y = sales)) +
  geom_point() +
  theme_bw()

Why is the mean the best guess?

We are going to try to make the BEST guess we can at each value of sales
The best guess will be the one with the lowest possible squared error (residuals) between:
- Our guess and the real data
Why squared?
- This makes every error positive
  - Penalizing us for guessing too high and too low
- This penalizes us for very wrong guesses more than slightly wrong ones

Why is the mean the best guess?

I’ll start by guessing the mean, or average

guess <- mean(reiDataSmall$sales)
reiDataSmall$guess <- guess
reiDataSmall

     weight     sales    guess
1  3.160536 10.069211 10.16274
2  4.970014 10.131396 10.16274
3  6.759033  8.856539 10.16274
4  3.068526 10.783086 10.16274
5  5.254373 11.149486 10.16274
6  4.769999 10.239619 10.16274
7  3.831546 10.078726 10.16274
8  3.805244 10.128758 10.16274
9  3.172416 10.696915 10.16274
10 2.901195  9.825507 10.16274
11 2.461316 10.569462 10.16274
12 2.737338 10.440997 10.16274
13 2.982250  9.584194 10.16274
14 5.910042 11.216148 10.16274
15 5.154571  9.845226 10.16274
16 6.090791  9.356569 10.16274
17 5.153109 10.137007 10.16274
18 4.059393  9.513671 10.16274
19 5.565751  9.480360 10.16274
20 2.473582 11.151987 10.16274

Why is the mean the best guess?

How wrong was I?

Subtract the real data from the guess

reiDataSmall$how_wrong <- reiDataSmall$guess - reiDataSmall$sales

Square them to make them all positive

reiDataSmall$how_wrong2 <- (reiDataSmall$guess - reiDataSmall$sales) ^ 2
reiDataSmall

     weight     sales    guess   how_wrong   how_wrong2
1  3.160536 10.069211 10.16274  0.09353188 0.0087482124
2  4.970014 10.131396 10.16274  0.03134738 0.0009826582
3  6.759033  8.856539 10.16274  1.30620416 1.7061693156
4  3.068526 10.783086 10.16274 -0.62034322 0.3848257155
5  5.254373 11.149486 10.16274 -0.98674322 0.9736621774
6  4.769999 10.239619 10.16274 -0.07687579 0.0059098874
7  3.831546 10.078726 10.16274  0.08401714 0.0070588794
8  3.805244 10.128758 10.16274  0.03398506 0.0011549846
9  3.172416 10.696915 10.16274 -0.53417160 0.2853392931
10 2.901195  9.825507 10.16274  0.33723672 0.1137286034
11 2.461316 10.569462 10.16274 -0.40671836 0.1654198253
12 2.737338 10.440997 10.16274 -0.27825397 0.0774252726
13 2.982250  9.584194 10.16274  0.57854943 0.3347194427
14 5.910042 11.216148 10.16274 -1.05340448 1.1096609896
15 5.154571  9.845226 10.16274  0.31751697 0.1008170242
16 6.090791  9.356569 10.16274  0.80617426 0.6499169379
17 5.153109 10.137007 10.16274  0.02573597 0.0006623402
18 4.059393  9.513671 10.16274  0.64907221 0.4212947353
19 5.565751  9.480360 10.16274  0.68238311 0.4656467078
20 2.473582 11.151987 10.16274 -0.98924365 0.9786030031

Why is the mean the best guess?

The column how_wrong2 is equal to guess - sales, squared - …i.e., the guess, minus the real data, squared

How wrong was I in total?

sum(reiDataSmall$how_wrong2)

[1] 7.791746

Why is the mean the best guess?

To see if that’s the best guess we could have made:

Let’s try some others!

Why is the mean the best guess?

I’ll guess something else

guess <- 11
reiDataSmall$guess <- guess
reiDataSmall

     weight     sales guess   how_wrong   how_wrong2
1  3.160536 10.069211    11  0.09353188 0.0087482124
2  4.970014 10.131396    11  0.03134738 0.0009826582
3  6.759033  8.856539    11  1.30620416 1.7061693156
4  3.068526 10.783086    11 -0.62034322 0.3848257155
5  5.254373 11.149486    11 -0.98674322 0.9736621774
6  4.769999 10.239619    11 -0.07687579 0.0059098874
7  3.831546 10.078726    11  0.08401714 0.0070588794
8  3.805244 10.128758    11  0.03398506 0.0011549846
9  3.172416 10.696915    11 -0.53417160 0.2853392931
10 2.901195  9.825507    11  0.33723672 0.1137286034
11 2.461316 10.569462    11 -0.40671836 0.1654198253
12 2.737338 10.440997    11 -0.27825397 0.0774252726
13 2.982250  9.584194    11  0.57854943 0.3347194427
14 5.910042 11.216148    11 -1.05340448 1.1096609896
15 5.154571  9.845226    11  0.31751697 0.1008170242
16 6.090791  9.356569    11  0.80617426 0.6499169379
17 5.153109 10.137007    11  0.02573597 0.0006623402
18 4.059393  9.513671    11  0.64907221 0.4212947353
19 5.565751  9.480360    11  0.68238311 0.4656467078
20 2.473582 11.151987    11 -0.98924365 0.9786030031

Why is the mean the best guess?

How wrong was I?

Subtract the real data from the guess

reiDataSmall$how_wrong <- reiDataSmall$guess - reiDataSmall$sales

Square them to make them all positive

reiDataSmall$how_wrong2 <- (reiDataSmall$guess - reiDataSmall$sales) ^ 2
reiDataSmall

     weight     sales guess  how_wrong how_wrong2
1  3.160536 10.069211    11  0.9307886 0.86636747
2  4.970014 10.131396    11  0.8686041 0.75447313
3  6.759033  8.856539    11  2.1434609 4.59442468
4  3.068526 10.783086    11  0.2169135 0.04705148
5  5.254373 11.149486    11 -0.1494865 0.02234620
6  4.769999 10.239619    11  0.7603810 0.57817920
7  3.831546 10.078726    11  0.9212739 0.84874557
8  3.805244 10.128758    11  0.8712418 0.75906230
9  3.172416 10.696915    11  0.3030852 0.09186061
10 2.901195  9.825507    11  1.1744935 1.37943490
11 2.461316 10.569462    11  0.4305384 0.18536330
12 2.737338 10.440997    11  0.5590028 0.31248410
13 2.982250  9.584194    11  1.4158062 2.00450713
14 5.910042 11.216148    11 -0.2161477 0.04671984
15 5.154571  9.845226    11  1.1547737 1.33350233
16 6.090791  9.356569    11  1.6434310 2.70086548
17 5.153109 10.137007    11  0.8629927 0.74475644
18 4.059393  9.513671    11  1.4863290 2.20917378
19 5.565751  9.480360    11  1.5196399 2.30930530
20 2.473582 11.151987    11 -0.1519869 0.02310002

Why is the mean the best guess?

The column how_wrong2 is equal to guess - sales, squared - …i.e., the guess, minus the real data, squared

How wrong was I in total?

sum(reiDataSmall$how_wrong2)

[1] 21.81172

Why is the mean the best guess?

I’ll guess something else

guess <- 9
reiDataSmall$guess <- guess
reiDataSmall

     weight     sales guess  how_wrong how_wrong2
1  3.160536 10.069211     9  0.9307886 0.86636747
2  4.970014 10.131396     9  0.8686041 0.75447313
3  6.759033  8.856539     9  2.1434609 4.59442468
4  3.068526 10.783086     9  0.2169135 0.04705148
5  5.254373 11.149486     9 -0.1494865 0.02234620
6  4.769999 10.239619     9  0.7603810 0.57817920
7  3.831546 10.078726     9  0.9212739 0.84874557
8  3.805244 10.128758     9  0.8712418 0.75906230
9  3.172416 10.696915     9  0.3030852 0.09186061
10 2.901195  9.825507     9  1.1744935 1.37943490
11 2.461316 10.569462     9  0.4305384 0.18536330
12 2.737338 10.440997     9  0.5590028 0.31248410
13 2.982250  9.584194     9  1.4158062 2.00450713
14 5.910042 11.216148     9 -0.2161477 0.04671984
15 5.154571  9.845226     9  1.1547737 1.33350233
16 6.090791  9.356569     9  1.6434310 2.70086548
17 5.153109 10.137007     9  0.8629927 0.74475644
18 4.059393  9.513671     9  1.4863290 2.20917378
19 5.565751  9.480360     9  1.5196399 2.30930530
20 2.473582 11.151987     9 -0.1519869 0.02310002

Why is the mean the best guess?

How wrong was I?

Subtract the real data from the guess

reiDataSmall$how_wrong <- reiDataSmall$guess - reiDataSmall$sales

Square them to make them all positive

reiDataSmall$how_wrong2 <- (reiDataSmall$guess - reiDataSmall$sales) ^ 2
reiDataSmall

     weight     sales guess  how_wrong how_wrong2
1  3.160536 10.069211     9 -1.0692114 1.14321296
2  4.970014 10.131396     9 -1.1313959 1.28005662
3  6.759033  8.856539     9  0.1434609 0.02058103
4  3.068526 10.783086     9 -1.7830865 3.17939738
5  5.254373 11.149486     9 -2.1494865 4.62029208
6  4.769999 10.239619     9 -1.2396190 1.53665537
7  3.831546 10.078726     9 -1.0787261 1.16365003
8  3.805244 10.128758     9 -1.1287582 1.27409505
9  3.172416 10.696915     9 -1.6969148 2.87952000
10 2.901195  9.825507     9 -0.8255065 0.68146104
11 2.461316 10.569462     9 -1.5694616 2.46320975
12 2.737338 10.440997     9 -1.4409972 2.07647300
13 2.982250  9.584194     9 -0.5841938 0.34128242
14 5.910042 11.216148     9 -2.2161477 4.91131075
15 5.154571  9.845226     9 -0.8452263 0.71440747
16 6.090791  9.356569     9 -0.3565690 0.12714145
17 5.153109 10.137007     9 -1.1370073 1.29278555
18 4.059393  9.513671     9 -0.5136710 0.26385794
19 5.565751  9.480360     9 -0.4803601 0.23074587
20 2.473582 11.151987     9 -2.1519869 4.63104763

Why is the mean the best guess?

The column how_wrong2 is equal to guess - sales, squared - …i.e., the guess, minus the real data, squared

How wrong was I in total?

sum(reiDataSmall$how_wrong2)

[1] 34.83118

Why is the mean the best guess?

To see if that’s the best guess we could have made:

Let’s try some others!
Let’s simulate a thousand other guesses:
- From the minimum sales to the maximum

# Set up simulation with a data frame to hold results
sims <- 1000
simulation.results <- data.frame(
  guess = c(mean(reiDataSmall$sales),
            seq(length.out = sims-1, 
              from = min(reiDataSmall$sales),
              to = max(reiDataSmall$sales))),
  how_wrong2 = rep(NA, times = sims)
)

Why is the mean the best guess?

Run simulation with for() loop

for( i in 1:sims){
  simulation.results$how_wrong2[i] <- sum(
    (simulation.results$guess[i] - reiDataSmall$sales) ^2 )
}

Why is the mean the best guess?

What does it look like if I plot how wrong I was by what my guess was?

ggplot(data = simulation.results,
       aes(x = guess,
           y = how_wrong2)) +
  geom_point()

Why is the mean the best guess?

What is the minimum amount I was wrong?

min(simulation.results$how_wrong2)

[1] 7.791746

What was my guess at that point?

bestguess <- min(simulation.results$how_wrong2)
simulation.results[simulation.results$how_wrong2 == bestguess,]

     guess how_wrong2
1 10.16274   7.791746

And what was the mean again?

mean(reiDataSmall$sales)

[1] 10.16274

Why is the mean the best guess?

ggplot(data = simulation.results,
       aes(x = guess,
           y = how_wrong2)) +
  geom_point() + 
  geom_point(data = simulation.results[simulation.results$guess == mean(reiDataSmall$sales),],
             aes(x = guess,
                 y = how_wrong2),
             color = 'red',
             size = 4)

Fun fact:

The median will minimize the sum of absolute error:

simulation.results.median <- data.frame(
  guess = c(median(reiDataSmall$sales),
            seq(length.out = sims-1, 
              from = min(reiDataSmall$sales),
              to = max(reiDataSmall$sales))),
  how_wrong = rep(NA, times = sims)
)

for( i in 1:sims){
  simulation.results.median$how_wrong[i] <- sum(abs(simulation.results.median$guess[i] - reiDataSmall$sales))
}

Fun fact:

The median will minimize the sum of absolute error:

ggplot(data = simulation.results.median,
       aes(x = guess,
           y = how_wrong)) +
  geom_point()

Fun fact:

The median will minimize the sum of absolute error:

What is the minimum amount I was wrong?

min(simulation.results.median$how_wrong)

[1] 9.777342

What was my guess at that point?

bestguess.median <- min(simulation.results.median$how_wrong)
simulation.results.median[simulation.results.median$how_wrong == bestguess.median,]

       guess how_wrong
1   10.13008  9.777342
541 10.13092  9.777342

Linear Regression

If we did not know a shoe’s weight, what would be our best guess of sales?

The mean sales!

mean(reiDataSmall$sales)

[1] 10.16274

Linear Regression

Graphically, that “guess” looks like this:

ggplot(reiDataSmall,
       aes(x = 1, 
           y = sales)) +
  geom_point() +
  geom_hline(yintercept = mean(reiDataSmall$sales))

Linear Regression

But what if we have information about their weight?

ggplot(reiDataSmall,
       aes(x = weight, 
           y = sales)) +
  geom_point()

Linear Regression

How does our old guess look?

ggplot(reiDataSmall,
       aes(x = weight, 
           y = sales)) +
  geom_point() +
  geom_hline(yintercept = mean(reiDataSmall$sales))

Linear Regression

That flat line is no longer our best guess, is it?

In this case, the mean is no longer the best possible guess
- This is because there are large residuals between the line we guess and many points
  - Especially on the edges

Linear Regression

Instead, what is our new best guess?

It’s the one that minimizes the sum of squared residuals, given that we now know weight
We can find it the same way we found the best guess without weight
But rather than guessing one number, we’ll guess that the slope of this line is:

Linear Regression

What is our new best guess?

I’m going to test a bunch of different intercepts and slopes

sims <- 10000
simulation.results <- data.frame(
   guess.intercept = runif(n = sims,
               min = 400,
               max = 500),
   guess.slope = runif(n = sims,
               min = 0,
               max = 15),
   how_wrong2 = rep(NA, times = sims)
)
 
for( i in 1:sims){
   for( j in 1:nrow(reiDataSmall)){
     reiDataSmall$how_wrong2[j] <- ((simulation.results$guess.intercept[i] + simulation.results$guess.slope[i] * reiDataSmall$weight[j]) - reiDataSmall$sales[j])^2
   }
   simulation.results$how_wrong2[i] <- sum(reiDataSmall$how_wrong2)
 }

Linear Regression

What is our new best guess?

I’m going to test a bunch of different intercepts and slopes
What gets us the lowest squared error?

simulation.results[simulation.results$how_wrong2 == min(simulation.results$how_wrong2),]

    guess.intercept guess.slope how_wrong2
848        400.0106   0.1137502    3047116

To get the exact numbers

Use lm() to find the coefficient for weight

lm(data = reiDataSmall,
   formula = sales ~ weight)


Call:
lm(formula = sales ~ weight, data = reiDataSmall)

Coefficients:
(Intercept)       weight  
    10.9061      -0.1764

Linear Regression

That’s how we end up with this line:

Linear Regression

These best guesses are now different at each level of weight:

weight.lm <- lm(data = reiDataSmall,
   formula = sales ~ weight)
reiDataSmall$bestguess.lm <- weight.lm$coef["(Intercept)"] + 
  weight.lm$coef["weight"]  * reiDataSmall$weight
reiDataSmall$how_wrong2.lm <- ( reiDataSmall$bestguess.lm - reiDataSmall$sales)^2

reiDataSmall[, c('weight', 'bestguess.lm')]

     weight bestguess.lm
1  3.160536    10.348593
2  4.970014    10.029385
3  6.759033     9.713786
4  3.068526    10.364824
5  5.254373     9.979221
6  4.769999    10.064669
7  3.831546    10.230221
8  3.805244    10.234861
9  3.172416    10.346497
10 2.901195    10.394343
11 2.461316    10.471942
12 2.737338    10.423249
13 2.982250    10.380044
14 5.910042     9.863555
15 5.154571     9.996827
16 6.090791     9.831670
17 5.153109     9.997085
18 4.059393    10.190026
19 5.565751     9.924291
20 2.473582    10.469778

Linear Regression

And what are the residuals if we take weight into account?

ggplot(reiDataSmall,
       aes(x = weight, 
           y = sales)) +
  geom_point() +
  geom_smooth(formula = 'y~x',
              method = 'lm', 
              se = F, color = 'black') +
  geom_segment(aes(xend = weight, yend = reiDataSmall$bestguess.lm), 
               color = "red",
               linetype = 'dashed',
               size = 1) +
  geom_point()

Linear Regression

And what are the residuals if we take weight into account?

reiDataSmall[, c('weight', 'bestguess.lm', 'how_wrong2.lm')]

     weight bestguess.lm how_wrong2.lm
1  3.160536    10.348593  0.0780539703
2  4.970014    10.029385  0.0104062769
3  6.759033     9.713786  0.7348715017
4  3.068526    10.364824  0.1749433569
5  5.254373     9.979221  1.3695209324
6  4.769999    10.064669  0.0306075101
7  3.831546    10.230221  0.0229505609
8  3.805244    10.234861  0.0112577147
9  3.172416    10.346497  0.1227925882
10 2.901195    10.394343  0.3235748427
11 2.461316    10.471942  0.0095101693
12 2.737338    10.423249  0.0003150111
13 2.982250    10.380044  0.6333775757
14 5.910042     9.863555  1.8295062522
15 5.154571     9.996827  0.0229828344
16 6.090791     9.831670  0.2257205210
17 5.153109     9.997085  0.0195782530
18 4.059393    10.190026  0.4574564464
19 5.565751     9.924291  0.1970749260
20 2.473582    10.469778  0.4654095167

sum(reiDataSmall$how_wrong2.lm)

[1] 6.739911

Linear Regression

Is that smaller than before?

SSE from mean:

sum(reiDataSmall$how_wrong2)

[1] 4607416

SSE including weight:

sum(reiDataSmall$how_wrong2.lm)

[1] 6.739911

But is it worth drawing that line?

Is the reduction in squared error worth it?
We’re always going to reduce error when we add more things into a model
But we only want to add things that are really worth it
- Overfitting old data makes us likely to make worse predictions in the future
- Complexity makes our results hard to explain

But is it worth drawing that line?

ggplot(reiDataSmall,
       aes(x = weight, 
           y = sales)) +
  geom_point() +
  geom_hline(yintercept = mean(reiDataSmall$sales))

ggplot(reiDataSmall,
       aes(x = weight, 
           y = sales)) +
  geom_point() +
  geom_smooth(formula = 'y~x',
              method = 'lm', 
              se = F)

Testing Linear Regression

Use summary() around lm() to find the coefficient and p-value weight

summary(lm(data = reiDataSmall,
   formula = sales ~ weight))


Call:
lm(formula = sales ~ weight, data = reiDataSmall)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.85725 -0.45172 -0.04418  0.21882  1.35259 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  10.9061     0.4642  23.496 5.87e-15 ***
weight       -0.1764     0.1053  -1.676    0.111    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6119 on 18 degrees of freedom
Multiple R-squared:  0.135, Adjusted R-squared:  0.08694 
F-statistic: 2.809 on 1 and 18 DF,  p-value: 0.111

What is the meaning of this p-value?
It’s the same as any p-value
- “What is the probability of seeing a slope this far (or more) from zero?”
  - If there was no true relationship between weight and sales

Linear Regression

What (I think) you should take away from this lesson if nothing else.

All linear regression is is answering the question:
- “If I know something about variable x, what is my best guess about the value of y?”
- Specifically, what is the best line I can draw through my data
And for the most part, “fancier” statistics essentially boil down to linear regression.

Linear Regression with Discrete/Group Predictors

Discrete/Group Predictors

We can use regression as just an extension of a t-test

Does cushion impact sales?
Let’s simplify this by removing the “Medium” cushion group

reiData <- reiData[reiData$cushion != "Medium",]

First, let’s plot this

Discrete/Group Predictors

Does cushion impact sales?

Plot!

reiData |>
  group_by(cushion) |>
  summarise(sales = mean(sales, na.rm = T)) |> # Not sure if we have missing data
  ggplot(
    aes(x = cushion,
        y = sales)) +
  geom_bar(stat = "identity")

Discrete/Group Predictors

What we are basically doing is comparing these plots:

And answering the question:

Is it worth knowing that we can separate the data into these two groups?

Group Predictors

So let’s see!

Using the lm() function in R:

lm( 
  data = reiData, 
  formula = sales ~ cushion ) |> # The dv is sales, IV is "cushion"
  summary() # Summarise it


Call:
lm(formula = sales ~ cushion, data = reiData)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3956 -0.4987  0.0878  0.6200  3.1414 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 10.32820    0.04648 222.193   <2e-16 ***
cushionLow  -0.69344    0.07466  -9.288   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9583 on 692 degrees of freedom
Multiple R-squared:  0.1108,    Adjusted R-squared:  0.1096 
F-statistic: 86.26 on 1 and 692 DF,  p-value: < 2.2e-16

What do you notice about this result, compared to the one from the t-test?

t.test( 
  reiData[reiData$cushion == 'Low', 'sales'], 
  reiData[reiData$cushion == 'High', 'sales'])


    Welch Two Sample t-test

data:  reiData[reiData$cushion == "Low", "sales"] and reiData[reiData$cushion == "High", "sales"]
t = -8.9923, df = 509.88, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -0.8449394 -0.5419355
sample estimates:
mean of x mean of y 
 9.634764 10.328201

It’s the same thing!

Discrete/Group Predictors

Why is the result the same thing?

Because a regression and a t-test are mostly the same thing
- Take some data
- Compare the difference in two groups to the difference that would happen from chance alone

Discrete/Group Predictors

Let’s extend this.

Read the entire data back in, which has the third group

reiData <- read.csv('rei_products.csv')

Discrete/Group Predictors

I want to know:

If cushion is worth knowing for sales:
- If the three levels differ from eachother

Answer

Null hypothesis: That there is no difference in sales between cushion levels.

lm(
  data = reiData, 
  sales ~ cushion) |>
  summary()


Call:
lm(formula = sales ~ cushion, data = reiData)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3956 -0.4759  0.0694  0.5737  3.1414 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   10.32820    0.04280 241.313  < 2e-16 ***
cushionLow    -0.69344    0.06875 -10.087  < 2e-16 ***
cushionMedium -0.28409    0.06485  -4.381 1.31e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8823 on 1019 degrees of freedom
Multiple R-squared:  0.09084,   Adjusted R-squared:  0.08905 
F-statistic: 50.91 on 2 and 1019 DF,  p-value: < 2.2e-16

reiData |>
  group_by(cushion) |>
  summarise(sales = mean(sales, na.rm = T)) |>
  ggplot(
    aes(x = cushion,
        y = sales)) +
  geom_bar(stat = 'identity', fill = c('seagreen', 'goldenrod', 'seagreen3')) +
  theme_bw()

Conclusion: There is likely a difference between conditions, as the likelihood that a difference this large arises by chance alone is < .001%.

Answer

To make the plot less gross

reiData$cushion <- factor(reiData$cushion, levels = c("Low", "Medium", "High"))
lm(
  data = reiData, 
  sales ~ cushion) |>
  summary()


Call:
lm(formula = sales ~ cushion, data = reiData)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.3956 -0.4759  0.0694  0.5737  3.1414 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    9.63476    0.05380  179.09  < 2e-16 ***
cushionMedium  0.40935    0.07258    5.64  2.2e-08 ***
cushionHigh    0.69344    0.06875   10.09  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8823 on 1019 degrees of freedom
Multiple R-squared:  0.09084,   Adjusted R-squared:  0.08905 
F-statistic: 50.91 on 2 and 1019 DF,  p-value: < 2.2e-16

reiData |>
  group_by(cushion) |>
  summarise(sales = mean(sales, na.rm = T)) |>
  ggplot(
    aes(x = cushion,
        y = sales)) +
  geom_bar(stat = 'identity', fill = c('seagreen', 'goldenrod', 'seagreen3')) +
  theme_bw()

Discrete/Group Predictors

But what if I want to know:

If it is worth knowing if cushion is “High” specifically
- Vs. Low and Medium
- You’ll have to construct a “dummy code”
- Google this
Test with lm()
Plot the condition means

Discrete/Group Predictors

Answer

Null hypothesis: That there is no difference in sales between cushion high and the combination of low and medium.

reiData$cushionHigh <- ifelse(reiData$cushion == "High", 1, 0)

lm(
  data = reiData, 
  sales ~ cushionHigh) |>
  summary()


Call:
lm(formula = sales ~ cushionHigh, data = reiData)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6205 -0.4556  0.0827  0.5721  3.1414 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  9.85967    0.03665 268.998  < 2e-16 ***
cushionHigh  0.46853    0.05684   8.243 5.12e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8956 on 1020 degrees of freedom
Multiple R-squared:  0.06246,   Adjusted R-squared:  0.06154 
F-statistic: 67.95 on 1 and 1020 DF,  p-value: 5.116e-16

reiData |>
  group_by(cushionHigh) |>
  summarise(sales = mean(sales, na.rm = T)) |>
  ggplot(
    aes(x = cushionHigh,
        y = sales)) +
  geom_bar(stat = 'identity', fill = c('seagreen', 'goldenrod')) +
  theme_bw()

Conclusion: There is likely a difference.

Discrete/Group Predictors

But what if I want to know whether the relationship between the three conditions is linear?

i.e., Low is worst, then medium, then high?

We can construct a contrast code

reiData$cushionLinear <- ifelse( reiData$cushion == "Low", -1,
                                 ifelse(reiData$cushion == "Medium", 0, 1))

summary(lm(data = reiData, sales ~ cushionLinear))


Call:
lm(formula = sales ~ cushionLinear, data = reiData)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4211 -0.4745  0.0694  0.5697  3.1253 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)   10.00230    0.02809  356.11   <2e-16 ***
cushionLinear  0.34204    0.03408   10.04   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8824 on 1020 degrees of freedom
Multiple R-squared:  0.08985,   Adjusted R-squared:  0.08896 
F-statistic: 100.7 on 1 and 1020 DF,  p-value: < 2.2e-16

Linear Regression: Controlling for variables

Controlling for variables

Sometimes we think that some… other variable predicts our outcome, and we want to take its effect away.

e.g., If we thought shoes sold more when they were high cushion, but we know that cushion effects weight
To “control” for this weight, we can add (+) it to our model:

lm(
  data = reiData, 
  formula = sales ~ cushion + weight) |>
  summary()


Call:
lm(formula = sales ~ cushion + weight, data = reiData)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1702 -0.4305  0.0788  0.5436  3.3486 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    8.63808    0.11750  73.516   <2e-16 ***
cushionMedium  0.06406    0.07865   0.815    0.416    
cushionHigh   -0.13318    0.10961  -1.215    0.225    
weight         0.41007    0.04343   9.442   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8465 on 1018 degrees of freedom
Multiple R-squared:  0.164, Adjusted R-squared:  0.1616 
F-statistic: 66.59 on 3 and 1018 DF,  p-value: < 2.2e-16

Controlling for variables

Rules for controlling for variables:

First:

Test the effect of the control on the DV alone

lm(
  data = reiData, 
  formula = sales ~ weight) |>
  summary()


Call:
lm(formula = sales ~ weight, data = reiData)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1616 -0.4220  0.0828  0.5490  3.2713 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  8.78927    0.09499   92.53   <2e-16 ***
weight       0.35751    0.02577   13.87   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8484 on 1020 degrees of freedom
Multiple R-squared:  0.1587,    Adjusted R-squared:  0.1579 
F-statistic: 192.4 on 1 and 1020 DF,  p-value: < 2.2e-16

If it is meaningful, it’s worth controlling for
If the control does not predict the DV, it can’t affect our result

Controlling for variables

Rules for controlling for variables:

Second:

Test the effect of the IV on the control

lm(
  data = reiData, 
  weight ~ cushion) |>
  summary()


Call:
lm(formula = weight ~ cushion, data = reiData)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.0509 -0.3196 -0.0154  0.2716  3.0308 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    2.43052    0.03723   65.29   <2e-16 ***
cushionMedium  0.84201    0.05023   16.77   <2e-16 ***
cushionHigh    2.01580    0.04757   42.37   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6106 on 1019 degrees of freedom
Multiple R-squared:  0.6494,    Adjusted R-squared:  0.6487 
F-statistic: 943.7 on 2 and 1019 DF,  p-value: < 2.2e-16

Controlling for variables

If the control predicts the DV AND IV predicts control:

We say this “IV is confounded with the control variable”
- It is harder to disentangle:
  - “IV leads to DV” from
  - “control leads to DV, and that is why it looks like IV leads to DV”

If the control predicts the DV BUT IV DOES NOT predict control:

We include the control, and it makes our IV appear more reliable
- It reduces the amount of variance in the DV that our IV has to explain

Controlling for variables

To know from this:

What a “control variable” is
What two steps we perform to decide whether to include a control
What if the control predicts our DV?
What if the IV also predicts our control?

Linear Regression Wrap Up

We’re basically always drawing lines through points in statistics:
- Logistic regression:
  - Lines, but they have to be between 0 and 1
- Lasso Regression (machine learning):
  - Lines, but a little more conservative
- Random Forests (machine learning):
  - Lines, but lots of lines, and just to a single end point

Correlation

Correlation

Are these scatterplots showing strong relationships?

Two ways to know with numbers:

Regression
- Using lm() in R
- Bueno
Correlations:
- Using cor() in R

Correlation Coefficients

Specifically, this is a Pearson correlation coefficient

Often abbreviated with r.
This is a continuous metric between -1 and 1.
- Perfectly positive (x goes up, y goes up the same): +1
- Perfectly negative: -1

plot(c(1:50), 
     c(51:100))

plot(c(50:1), 
     c(51:100))

Correlation Matrices

For more than two variables, you can compute the correlations between all pairs x, y at once as a correlation matrix.

Let’s look at one:

cor(reiData[, c("sales", "price", "weight")], use = 'pairwise.complete.obs') |>
  round(3)

        sales  price weight
sales   1.000 -0.132  0.398
price  -0.132  1.000  0.747
weight  0.398  0.747  1.000

Correlation Coefficients

Problem with correlation:

We can’t make predictions with correlations

We can say “y increases as x increases” generally
- But not “y increases this much as x increases this much”